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© An automated optical character recognition 
method is provided for use in conjunction with a 
programmable digital processing device. The meth- 
od inputs a sequence of values representing one or 
more characters in an array of characters to be 
optically recognized. The 'values define one or more 
dimensional characteristics of the characters. From 
the input values, a standard dimensional value is 
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determined from a frequency distribution of a se- 
lected one of the character dimensional characteris- 
tics. For each of the input characters, a set of 
normalized values is determined from the standard 
dimensional value. The normalized values corre- 
spond to the one or more character dimensional 
characteristics. Optical character recognition is 
thereafter performed using the normalized values. 
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The present invention is directed generally to 
optical character recognition, and more particularly, 
to automated methods and apparatus for reducing 
recognition errors, especially those resulting from 
an inability to distinguish between upper and lower 
case characters, or other characters of similar 
shape but dissimilar size or position. 

Programmable computers and digital process- 
ing apparatus have proven useful for optical char- 
acter recognition in which visual character indicia, 
such as printed text, are scanned, identified and 
assigned a character code value that can be stored 
electronically. A word processing "document" is 
one example of a data file containing character 
code values which a computer is able to interpret 
and reproduce in human-readable form on a CRT 
or as a printed document. There are many char- 
acter code conventions in use today, the most 
common being the ASCII (American Standard Code 
for Information Interchange) code system. 

Many existing optical character recognition 
systems make recognition errors between char- 
acters that are very similar in shape but are of 
different size or located at different positions. Up- 
per and lowercase characters (S/s) and apostro- 
phes and commas (V,), for example, are prone to 
such errors. No matter how similar the shapes, 
their size or position is usually so different that this 
kind of error must be avoided. 

Although some optical character recognition 
systems utilize information on size and position for 
discrimination, they still suffer from recognition er- 
rors, particularly when encountering the many 
kinds of electronic fonts currently in use. Thus, it 
would be desirable to provide a system that utilizes 
size and position information in a fashion that en- 
hances the speed and recognition rate of optical 
character recognition. 

The foregoing and other objectives are 
achieved by the invention as claimed; accordingly, 
an automated optical character recognition method 
of novel design is provided for use in conjunction 
with a programmable digital processing device. 
The method inputs a sequence of values represent- 
ing one or more characters in an array of char- 
acters to be optically recognized. The input values 
define one or more dimensional characteristics of 
the characters. From the input values, a standard 
dimensional value is determined from a frequency 
distribution of a selected one of the character di- 
mensional characteristics. For each of the input 
characters, a set of normalized values is deter- 
mined from the standard dimensional value. The 
normalized values correspond to the one or more 
character dimensional characteristics. Optical char- 
acter recognition is thereafter performed using the 
normalized values. 



The objects, advantages and features of the 
present invention will be more clearly understood 
by reference to the following detailed disclosure 
and the accompanying drawing in which: 
5 Fig. 1 is a block diagram of an automated op- 

tical character recognition system constructed in 
accordance with the present invention; 
Fig. 2 is a an illustration of a line of printed text 
to be recognized; 
w Fig. 3 illustrates the size parameters measured 
on a text line; 

Figs. 4A and 4B is a flow diagram showing an 
automated optical character recognition method 
performed in accordance with the present inven- 
75 tion; and 

Fig. 5 is a frequency distribution graph showing 
quantized character height versus the frequency 
of the quantized height. 
Referring now to Fig. 1, optical character rec- 
20 ognition in accordance with the present invention 
may be performed by the illustrated data process- 
ing apparatus, which includes a programmable 
computer 10 having a keyboard (not shown), a data 
storage resource 20, a display monitor 30 and an 
25 optical scanning device 40. These components are 
conventionally known and may include a wide vari- 
ety of component types and system arrangements. 
The data processing apparatus is controlled by an 
OCR software system 50 which is resident during 
30 program operation in random access memory with- 
in the programmable computer. When the software 
system 50 is not operational, it is maintained within 
the data storage resource 20, which may be con- 
sidered to generically represent one or more of a 
35 variety of data storage devices including floppy 
magnetic disks, optical disks, magnetic tapes, por- 
table hard disks, and other apparatus. 

As described in more detail below, the software 
system 50 includes an executable instruction set 
40 for controlling the data processing apparatus for 
automatic recognition of characters formed as an 
array of characters on a sheet of printed text repre- 
senting an input document 60. Figs. 2 and 3 illus- 
trate one line 70 of printed text which might appear 
45 in the input document 60 and require character 
recognition. 

Figs. 4A and 4B illustrate the steps to be 
performed in the optical character recognition 
method of the present invention. The method en- 

50 hances the speed and recognition rate required for 
optically recognizing the characters in a line of text 
by dynamically normalizing such attributes as char- 
acter height, character width and character position 
relative to a baseline reference using a standard 

55 attribute determined from the characters them- 
selves. Step 80 of the method begins with a bit 
map file as input. The bit map file may exist in the 
data storage resource 20 or may be generated by 
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and input directly from the optical scanning device 
40. The optical scanning device 40 can be selected 
from several commercially available products ca- 
pable of producing a bi-level (binary) image or "bit 
map" at a continuity that preserves visual quality 5 
when the image is displayed on a screen or re- 
printed on paper. The binary image or bit map is 
stored in the computer's random access memory, 
and/or the data storage resource 30, and defines a 
two dimensional array of 0's and 1's representing w 
white and black, respectively. The 0*s and 1's are 
in one to one correspondence with the cells of a 
grid that can be imagined to overlay the printed 
page. Because automatic methods for the inter- 
pretation of text are less discriminating than human 75 
recognition processes, the number of cells per unit 
area has to be greater than, for example, the reso- 
lution used in facsimile transmission of documents, 
which typically does not exceed 200 cells per inch. 
Resolutions of 300 samples per inch are suitable, 20 
however, and character recognition is not signifi- 
cantly improved with higher resolutions. 

In step 90 of Fig. 4A , the OCR software 
system 50 operates on the input bit map data 
structure by segmenting the bit map array into 25 
single horizontal text lines for processing one line 
of characters at a time. Each line is processed as a 
sequence of binary input values representing one 
or more characters extracted from the array of 
characters resident on the input document 60 as a 30 
whole. The binary input values, being part of a two- 
dimensional array definition of the input characters, 
necessarily define one or more dimensional char- 
acteristics of the characters, such as character 
height, character width and character position rela- 35 
tive to a baseline reference. 

The baseline reference is calculated after text 
line segmentation in step 100 of Fig. 4A as a 
theoretical line passing through the bottom of a 
predetermined percentage of the characters. Fig. 3 40 
illustrates a baseline reference 102 that has been 
estimated by the OCR software system 50. There 
are various known methods which may be utilized 
to calculate such a baseline and any one method 
could be utilized in conjunction with the present 45 
invention. Once the baseline information is ob- 
tained, character position (starting from the bottom 
of the characters) can be determined relative to the 
baseline as a character attribute in addition to 
character height and width. The nature of these 50 
dimensional characteristics is graphically illustrated 
in Figs. 2 and 3. Fig. 2 illustrates the single line of 
text 70 reading "Speed and Recognition Enhance- 
ment Using Normalized Height/Width Position." 
Fig. 3 is a partially enlarged view of the text line 70 55 
of Fig. 2. It illustrates the baseline reference 102, 
an uppercase height attribute (HJ 104, a lowercase 
height attribute (H,) 106 and a width attribute (W) 



108. 

In step 110 of Fig. 4A, the maximum character 
height, character width and character position rela- 
tive to the baseline reference is calculated for each 
character of the input character line. The next step 
is to normalize these attribute values so that mean- 
ingful recognition comparisons can be made rela- 
tive to a reference library database. Normalization 
is performed using a "standard" attribute such as 
character height. Standard height may be thought 
of as an estimate of the height of typical upper 
case characters. This estimation is based on the 
"mode" in a statistical sense, which is used in 
preference to the highest value or an average value 
of the text because the heights of ordinary upper 
case, numeric, and some lower case characters 
(such as "b","d" and "h") have a small variation 
and are fairly stable. It is preferable not to use a 
highest value because characters such as "/", "f" 
and "Q" have a larger variation and sometimes do 
not appear in a text line/paragraph/page. The aver- 
age height is also unreliable because the value 
changes according to the relative frequency of 
characters. 

At the point where standard height is calcu- 
lated, the text baselines have been estimated and 
the size and position of the character patterns are 
known. Thus each character is represented for 
pulses of this stage by three positional measure- 
ments: w = width, h = height, and b = distance of the 
character bottom above the baseline. Standard 
height is calculated from a sequence of text as- 
sumed to be printed in only one font. The process 
thus estimates standard height for each line of text 
although standard height for a paragraph or page 
of text could also be used. If a line contains only a 
relatively small amount of text printed in a secon- 
dary font, the estimate of standard height need not 
be greatly affected, since the calculation is statisti- 
cal in nature, as explained below. 

The first step 120 (see Fig. 4A) in the calcula- 
tion of standard height is to construct a histogram 
of character height. A size constraint is imposed so 
that very small characters such as hyphens are 
eliminated from the count. The result is a distribu- 
tion graph 122, shown in Fig. 5, in which the "x" 
axis represents quantized character height ("h") 
and the "y" axis represents the frequency of occur- 
rence ("f(h)") of the quantized height. 

In step 130 (see Fig. 4A), this distribution is 
smoothed by summation to generate: 

F(h) = f(h-1) + f(h) + f(h + 1) 

The hypothesis that governs the remainder of the 
normalization process is that the smoothed histo- 
gram F(h) is multimodal. Its two major peaks 124 
and 126 (if both are present) correspond to tall 
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characters such as capital letters or letters with 
ascenders at one value of height (peak 126), and 
another peak for small lower case letters such as 
'a* or 'e' at a smaller value of height (peak 124). 
The process must also take into account the pos- 
sibility that the text is printed in upper case only, 
i.e., that there is no lower case peak. 

The major peaks are identified in step 140 of 
Fig. 4A. To do so, an initial peak hi is first de- 
tected by determining the global maximum of F(h). 
This peak may be either the desired upper case 
peak (i.e., the desired standard height) or else the 
lower case peak. The peak identification step 140 
thus proceeds to search for a second lower case 
peak h2 in the region: 

hi * (l-d2) < h2 <h1 

and a third upper case peak h3, in the region: 

hi < h3 < hi * (1 + d3) 

The value d3 is a predetermined positive param- 
eter representing an estimate of the largest factor 
by which upper case height can exceed lower case 
height over the range of fonts considered. This 
implies the relation: 

d2 = d3/(1 + d3) 

because 1/(1 -62) = (1 +d3)/1 

Once found, the proper peak to use for stan- 
dard height is calculated in step 150 of Fig. 4A by 
comparing the amplitude of the identified peaks 
("f") to the total character count "N". To qualify as 
a valid peak, F(h2) or F(h3) must be at least 10% 
(or some other reliable percentage) of N. in view of 
the definition of F(h), this means that at least 10% 
of the character heights must be within plus or 
minus one pixel of h2 or h3. If h3 passes this test 
and has amplitude greater than that of h2, it is 
selected for standard height. Otherwise hi is re- 
turned for standard height. 

Once the standard height is obtained for a text 
line/paragraph/page, a set of normalized 
height/width/position values are calculated for each 
character in steps 160, 170 and 180 of Fig. 4B, as 
follows: 

• normalized height (NH) = height/standard 
height 

• normalized width (NW) = width/standard 
height 

• normalized position (NP) = (1/2 * height + 
bottom above base line)/standard height. 

The normalized character values can be used as 
input to a pre-generated character recognition li- 
brary of conventional design in step 190 of Fig. 4B 
for generating a character code output set to com- 



plete the optical recognition process in step 200 of 
Fig. 4B. The optical recognition process imple- 
mented using the normalized character values can 
take several forms. It is assumed that each char- 

5 acter has some n-dimensional feature and it is 
classified by comparing the feature with a pre- 
generated recognition library (each library member 
or "template" has an n-dimensional feature and its 
category). The comparison could be made using 

70 Euclidian distance or by other known methods. The 
n-dimensional feature could also be replaced by 
other kinds of features such as a geometrical fea- 
ture. 

The n-dimensional feature vector used for com- 
75 parisons between the input characters and the rec- 
ognition library can be augmented using the nor- 
malized character values determined above as an 
extension of the vector. However, the normalized 
character values are perhaps best used to pre- 
20 screen the library recognition template patterns to 
eliminate patterns unlikely to yield a positive n- 
dimensional comparison. The recognition library is 
configured to include six additional features for use 
in pre-screening comparisons. They are: minimum 
25 normalized height, maximum normalized height, 
minimum normalized position, maximum normal- 
ized position, minimum normalized width and maxi- 
mum normalized width. At first the normalized 
height, normalized position and normalized width 
30 are compared with the minimum and maximum 
values of each template (prototype) in the library 
(the order H-P-W is preferable, because width has 
the most variations and is least reliable among the 
three). If the value of the input character doesn't 
35 satisfy the pre-screening conditions, the template is 
immediately ignored. If the six comparisons are 
satisfied, the comparison of n-dimensional features 
is executed for detailed classification. It has been 
observed that the six parameter comparison re- 
40 duces the number of candidates very quickly to 
less than half. It also excludes upper/lower case 
and apostrophe/comma confusion and increases 
the recognition rate. 

Accordingly, a speed and recognition enhance- 
rs ment method for optical character recognition has 
been described. While various embodiments have 
been disclosed, it should be apparent that many 
variations and alternative embodiments would be 
apparent to those skilled in the art in view of the 
so teachings herein. 

Claims 

1. An automated optical character recognition 
55 method for use on a programmable digital pro- 

cessing device, comprising the steps of: 

selecting a sequence of input values re- 
presenting one or more characters in an array 
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of characters to be optically recognized, said 
input values defining one or more dimensional 
characteristics of said characters; 

generating a standard dimensional value 
determined from a frequency distribution of a 
selected one of said character dimensional 
characteristics; 

generating for each of said characters a 
set of normalized values determined from said 
standard dimensional value, said normalized 
values corresponding to said one or more 
character dimensional characteristics; and 

performing optical character recognition 
using said normalized values. 



6. The method of any preceding claims wherein 
said frequency distribution is determined as a 
series of summations representing a frequency 
of occurrence of each character height value 
determined from said input values. 

7. The method of claim 1 wherein standard di- 
mensional value is selected from said frequen- 
cy distribution as a standard uppercase char- 
acter height, or if an insubstantial number of 
uppercase characters are present, as a stan- 
dard lowercase character height. 



10. The method of claim 9 wherein said optical 
character recognition further includes a second 
classification step of comparing additional 
character features with corresponding features 

5 in said library prototype template. 

11. The method of claim 10 wherein first classifica- 
tion step includes comparing said normalized 
values with corresponding value ranges in said 

w library prototype template. 

12. The method of claim 11 wherein said normal- 
ized values include normalized character 
height (NH), normalized character width (NW) 
and normalized character position (NP) relative 
to a beseline reference, and wherein said first 
classification step includes comparing said 
normalized values with predetermined mini- 
mum and maximum library value ranges (mini- 
mum NH, maximum NH, minimum NW, maxi- 
mum NW, minimum NP and maximum NP) to 
determine whether said normalized values fall 
within the minimum and maximum value 
ranges of said library prototype template. 

13. The method of claim 12 wherein said normal- 
ized values are compared with said minimum 
and maximum value ranges in the order of NH, 
NP and NW, with the library prototype tem- 
plate being ignored upon one of said normal- 
ized values falling outside of said minimum 
and maximum value range. 

14. The method of claim 13 wherein said second 
35 classification step compares selected features 

of said characters with a feature vector cor- 
responding to said library prototype template. 

15. The method of claim 14 wherein said selected 
40 features are n-dimensional features and are 

compared to said library feature vector using a 
distance method. 

16. An automated optical character recognition 
system characterized by comprising: 

means for selecting a sequence of input 
values representing one or more characters in 
an array of characters to be optically recog- 
nized, said input values defining one or more 
dimensional characteristics of said characters; 

means for generating a standard dimen- 
sional value determined from a frequency dis- 
tribution of a selected one of said character 
dimensional characteristics; 

means for generating for each of said 
characters a set of normalized values deter- 
mined from said standard dimensional value, 
said normalized values corresponding to said 



8. The method of any preceding claims wherein 
said normalized values include normalized 
character height, normalized character width, 
and normalized character position relative to a 
baseline reference. so 

9. The method of any preceding claims wherein 
said optical character recognition includes a 
first classification step of comparing said nor- 
malized values with corresponding values in a 55 
library of character prototype templates. 



15 

2. The method of claim 1 wherein said one or 
more characters are arranged on a single line 
of printed text. 

3. The method of claim 1 or 2 wherein said input 20 
values are bit map representations of said 
characters. 

4. The method of any preceding claims wherein 
said dimensional characteristics include char- 25 
acter height, character width and character po- 
sition relative to a baseline reference. 

5. The method of any preceding claims wherein 
said standard dimensional value is a value 30 
representing standard character height. 
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one or more character dimensional characteris- 
tics; and 

means for performing optical character 
recognition using said normalized values. 

17. The system of claim 16 wherein said one or 
more characters are arranged on a single line 
of printed text. 

18. The system of claim 16 or 17 wherein said 
input values are bit map representations of 
said characters. 

19. The system of any claim from 16 to 18 
wherein said dimensional characteristics in- 
clude character height, character width and 
character position relative to a baseline refer- 
ence. 

20. The system of any claim from 16 to 19 
wherein said standard dimensional value is a 
value representing standard character height. 

21. The system of any claim from 16 to 20 
wherein said frequency distribution is deter- 
mined as a series of summations representing 
a frequency of occurrence of each character 
height value determined from said input val- 
ues. 

22. The system of claim 16 wherein standard di- 
mensional value is selected from said frequen- 
cy distribution as a standard uppercase char- 
acter height, or if an insubstantial number of 
uppercase characters are present, as a stan- 
dard lowercase character height. 

23. The system of any claim from 16 to 22 
wherein said normalized values include nor- 
malized character height, normalized character 
width, and normalized character position rela- 
tive to a baseline reference. 



26. The system of claim 25 wherein first classifica- 
tion means includes means for comparing said 
normalized values with corresponding value 
ranges in said library prototype template. 

5 

27. The system of claim 26 wherein said normal- 
ized values include normalized character 
height (NH), normalized character width (NW) 
and normalized character position (NP) relative 

70 to a baseline reference, and wherein said first 

classification means includes means for com- 
paring said normalized values with predeter- 
mined minimum and maximum library value 
ranges (minimum NH, maximum NH, minimum 

75 NW, maximum NW, minimum NP and maxi- 

mum NP) to determine whether said normal- 
ized values fall within the minimum and maxi- 
mum value ranges of said library prototype 
template. 

20 

28. The system of claim 27 wherein said normal- 
ized values are compared with said minimum 
and maximum value ranges in the order of NH, 
NW and NP, with the library prototype tem- 

25 plate being ignored upon one of said normal- 

ized values falling outside of said minimum 
and maximum value range. 

29. The system of claim 28 wherein said second 
30 classification means includes means for com- 
paring selected features of said characters with 
a feature vector corresponding to said library 
prototype template. 

35 30. The system of claim 29 wherein said selected 
features are n-dimensional features and are 
compared to said library feature vector using a 
distance method. 



24. The system of any claim from 16 to 23 
wherein said means for performing optical 45 
character recognition includes a first classifica- 
tion means for comparing said normalized val- 
ues with corresponding values in a library of 
character prototype templates. 

50 

25. The system of claim 24 wherein said means 
for performing optical character recognition fur- 
ther includes a second classification means for 
comparing additional character features with 
corresponding features in said library proto- 55 
type template. 
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(57) An automated optical character recognition 
method is provided for use in conjunction with a program- 
mable digital processing device. The method inputs a 
sequence of values representing one or more characters 
in an array of characters to be optically recognized. The 
values define one or more dimensional characteristics of 
the characters. From the input values, a standard dimen- 
sional value is determined from a frequency distribution 



of a selected one of the character dimensional charac- 
teristics. For each of the input characters, a set of nor- 
malized values is determined from the standard 
dimensional value. The normalized values correspond to 
the one or more character dimensional characteristics. 
Optical character recognition is thereafter performed 
using the normalized values. 
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